A Bayesian Spatial Scan Statistic for Under-reported Data

August 14, 2025

Introduction

Public Health Surveillance

Public health surveillance
The systematic, ongoing assessment of the health of a community including the timely collection, analysis, interpretation, dissemination and subsequent use of data. 1

Outbreak Detection

A subset of disease surveillance methods focus on disease progression and outbreak detection.

Novel disease monitoring

New diseases often lack reliable testing and reporting systems. Early cases may be missed or misclassified, obscuring disease surveillance techniques that assume complete cases.

Examples

  • COVID-19
  • HIV/AIDS
  • Tuberculosis (TB)

Accounting for Under-reporting

Most methods proposed for modeling under-reported or misclassified data fall into two categories:

  1. Double sampling
  2. latent variable models

Spatial Scan Statistics

General Concept

Scan statistics

  1. Select candidate regions
  2. Calculate relative risk inside and outside of candidate region
  3. Determine region with largest difference

Visualization of Spatial Scan Regions

Frequentist Spatial Scan Statistic

  • The framework assumes that we observe counts \(z_i\) such that \(z_i \sim \text{Poisson}(qb_i)\)
    • Where \(b_i\) represents the known baseline/at risk population of cell \(S_i\)
    • \(q\) is the unknown underlying disease rate

\[ H_0: \text{No cluster (common rate for all regions)} \\ H_1(S): \text{Cluster in subset }S\text{ with elevated rate vs. outside } S \]

  • Compute likelihood ratio test statistic for each candidate zone \(S\)
  • The scan statistic test statistic is \(\Lambda = \max_{S \in C}\lambda(S)\).
  • Generate Monte Carlo samples under \(H_0\) to calculate P-value

Bayesian Spatial Scan Statistics

  • Assuming we observe count data \(z_i\) in area \(i\), each associated with baeline \(b_i\)
  • Under the null hypothesis there is no cluster and all locations share \(q_{all}\) \[ z_i \sim \text{Poisson}(q_{all} b_i), \quad q_{all} \sim \text{Beta}(\alpha_{all}, \beta_{all})\]
  • The alternative hypothesis for each candidate cluster \(i \in \mathcal{S}\), where \(\mathcal{S}\) is the space of all possible clusters \[ \begin{cases} z_i \sim \text{Poisson}(q_{in} b_i), &i \in S, \quad q_{in} \sim \text{Gamma}(\alpha_{in}, \beta_{in}), \\ z_i \sim \text{Poisson}(q_{out} b_i), &i \notin S, \quad q_{out} \sim \text{Gamma}(\alpha_{out}, \beta_{out}). \end{cases} \]
  • Marginal likelihoods based on the gamma-Poisson model
  • Conjugate model possible to solve for closed form solution

Bayesian Spatial Scan Statitic Testing

  • Using the maringal likelihoods from the models the posterior probability under the null is \[P(H_0 | D) = \frac{P(D|H_0) P(H_0)}{P(D)}\]
  • The posterior probability under the alternative is \[P(H_1(i) | D) = \frac{P(D|H_1{i}) P(H_1(i))}{P(D)}\]
  • Then we can return regions with non-negligible posterior probabilities
  • Since we have the full posterior probability distributions there is no need for randomization testing
  • Bayes factors can be used to provide a direct measure of evidence ofr one hypothesis over the other

Bayesian Interpretation

Interpretation of Bayes factor1
BF Log(BF) Strength of evidence \(H_0\)
1 to 3.2 0 to 1.16 Not Significant
3.2 to 10 1.16 to 2.30 Positive
10 to 100 2.30 to 4.61 Strong
\(>\) 100 \(> 4.61\) Decisive

Scan Statistic Timeline

timeline
    title Spatial Scan Statistic Development
    1965 : Conceptual basis - Naus
    1997 : Basic Spatial Scan Statistic (Frequentist)
    1998 : Space-Time Extension (Frequentist)
    2005 : Flexible Shapes (Frequentist)
    2005 : Bayesian Spatial Scan Statistic
    2007 : Multivariate Spatial Scan Statistic (Frequentist)
    2012 : Overdispersed data extension (Frequentist) 
    2017 : Bayesian Spatial Scan Statistic for Zero-inflated count data
    2018 : Wald-based Spatial Scan Statistics (Frequentist)
    2024 : Bayesian Spatial Scan Statistic for Multinormal data

  • Since the formalization in 1997 spatial scan statistics have been used and described as a method for epidemiologists
  • No extension to account for under-reported count data

Proposed Method

Model

  • We propose a novel Bayesian spatial scan statistic model by modeling the true counts as a latent variable and introducing reporting probability \(p\).
  • Our spatial scan statistic is based on the hierarchical model \[ z_i \sim \text{Poisson}(p \times q \times b_i) \\ q \sim \text{gamma}(\alpha, \beta) \\ p \sim \text{beta}(\alpha_p, \beta_p) \]
  • Model no longer conjugate

Bayesian Spatial Scan Statistic Extension

  • The new null hypothesys assumes no clusters \[ z_i \sim \text{Poisson}(p \times q_{all} \times b_i), \quad q_{all} \sim \text{gamma}(\alpha_{all}, \beta_{all}), \quad p \sim \text{beta}(\alpha, \beta) \]
  • The resulting alternative hypothesys for region \(i\) is \[ \begin{cases} z_i \sim \text{Poisson}(p \times q_{in} \times b_i), &i \in S, \quad q_{in} \sim \text{Gamma}(\alpha_{in}, \beta_{in}), \\ z_i \sim \text{Poisson}(p \times q_{out} \times b_i), &i \notin S, \quad q_{out} \sim \text{Gamma}(\alpha_{out}, \beta_{out}). \end{cases} \\ p \sim \text{beta}(\alpha, \beta) \]

Setting Priors

  • Neccessary to set an informative prior on reporting rate \(p\)
    • Historical Data
    • Expert elicitation
  • Can set a disfuse prior on the \(q\) parameters

Simulation Study

Simulation Design

  • 39 counties of Washington state with an outbreak of 3 counties in south eastern Washington.

Simulation Metrics

Even when the null hypothesis is correctly rejected, the detected clusters rarely match the true cluster exactly.

To evaluate how well they overlap we will use:

  • Sensitivity: Proportion of true cases correctly included
  • Positive Predicted Value (PPV): Proportion of detected cases that are actually in the true cluster

Simulation Results Visual

Data Application

Texas COVID-19 Data

  • COVID-19 data in early 2020 were severely under-reported due to limited testing and dificulty to diagnose (Hortaçsu, Liu, and Schwieg 2021)
  • Data (254 Counties)
    • COVID-19 cases (Probable and Confirmed)
    • Population

Real Data (priors)

  • Estimates from early COVID-19 studies suggest very low reporting rates (\(\approx 10\%\)), with low probability of exceeding 30\(\%\) (Chen, Song, Stamey 2022).
  • This information results in a prior of \(p \sim \text{Beta}(7, 55)\)1
  • Difusse priors where fit to \(q_\cdot\) parameters

\[ q_{all} \sim \text{gamma}() \\ q_{out} \sim \text{gamma}() \\ q_{in} \sim \text{gamma}() \]

Real Data Results

Both methods provide different most likely clusters;

  • Naive: Around the city of Houston
  • Under-reported: Around El Paso and north of DFW.

Bayes factors for each identified cluster is very large indicating significant evidence in favor of \(H_1\) over \(H_0\).

Discussion

  • Traditional scan statistics may fail when case counts are under-reported, common in emerging outbreaks
  • The proposed method models reporting probability, improving cluster detection under incomplete data
  • Comparison with confirmed cases suggest some true clusters (Texas Panhandle) remain undetected, indicating further refinement is needed

Future work

  • Extend to spatiotemporal model for real-time detection
  • Incorporate multivariate outcomes
  • Allow spatially varying rates to reflect local testing access

Bibliography